A Framework to Adjust Dependency Measure Estimates for Chance
نویسندگان
چکیده
Estimating the strength of dependency between two variables is fundamental for exploratory analysis and many other applications in data mining. For example: non-linear dependencies between two continuous variables can be explored with the Maximal Information Coefficient (MIC); and categorical variables that are dependent to the target class are selected using Gini gain in random forests. Nonetheless, because dependency measures are estimated on finite samples, the interpretability of their quantification and the accuracy when ranking dependencies become challenging. Dependency estimates are not equal to 0 when variables are independent, cannot be compared if computed on different sample size, and they are inflated by chance on variables with more categories. In this paper, we propose a framework to adjust dependency measure estimates on finite samples. Our adjustments, which are simple and applicable to any dependency measure, are helpful in improving interpretability when quantifying dependency and in improving accuracy on the task of ranking dependencies. In particular, we demonstrate that our approach enhances the interpretability of MIC when used as a proxy for the amount of noise between variables, and to gain accuracy when ranking variables during the splitting procedure in random forests.
منابع مشابه
Design and adjustment of dependency measures
Dependency measures are fundamental for a number of important applications in data mining and machine learning. They are ubiquitously used: for feature selection, for clustering comparisons and validation, as splitting criteria in random forest, and to infer biological networks, to list a few. More generally, there are three important applications of dependency measures: detection, quantificati...
متن کاملThe meaning of kappa: probabilistic concepts of reliability and validity revisited.
A framework--the "agreement concept"--is developed to study the use of Cohen's kappa as well as alternative measures of chance-corrected agreement in a unified manner. Focusing on intrarater consistency it is demonstrated that for 2 x 2 tables an adequate choice between different measures of chance-corrected agreement can be made only if the characteristics of the observational setting are take...
متن کاملساخت و اعتباریابی مقیاس دانش و نگرش جنسی
Abstract The main purpose of this study was to present an account of the development and examine psychometric properties of Sexual Knowledg and Attitude Scale (SKAS) including construct validity, convergent and discriminant validity, internal consistency, and test-retest reliability. Eight hundred and thirty seven Iranian men and women (385 men, 451 women) participated in this study, voluntari...
متن کاملData envelopment analysis in service quality evaluation: an empirical study
Service quality is often conceptualized as the comparison between service expectations and the actual performance perceptions. It enhances customer satisfaction, decreases customer defection, and promotes customer loyalty. Substantial literature has examined the concept of service quality, its dimensions, and measurement methods. We introduce the perceived service quality index (PSQI) as a sing...
متن کاملEmpirical estimates for various correlations in longitudinal-dynamic heteroscedastic hierarchical normal models
In this paper, we first define longitudinal-dynamic heteroscedastic hierarchical normal models. These models can be used to fit longitudinal data in which the dependency structure is constructed through a dynamic model rather than observations. We discuss different methods for estimating the hyper-parameters. Then the corresponding estimates for the hyper-parameter that causes the association...
متن کامل